Chromosome-level assemblies of cultivated water chestnut Trapa bicornis and its wild relative Trapa incisa

Water chestnut (Trapa L.) is a floating-leaved aquatic plant with high edible and medicinal value. In this study, we presented chromosome-level genome assemblies of cultivated large-seed species Trapa bicornis and its wild small-seed relative Trapa incisa by using PacBio HiFi long reads and Hi-C technology. The T. bicornis and T. incisa assemblies consisted of 479.90 Mb and 463.97 Mb contigs with N50 values of 13.52 Mb and 13.77 Mb, respectively, and repeat contents of 62.88% and 62.49%, respectively. A total of 33,306 and 33,315 protein-coding genes were predicted in T. bicornis and T. incisa assemblies, respectively. There were 159,232 structural variants affecting more than 11 thousand genes detected between the two genomes. The phylogenetic analysis indicated that the lineage leading to Trapa was diverged from the lineage to Sonneratia approximately 23 million years ago. These two assemblies provide valuable resources for future evolutionary and functional genomic research and molecular breeding of water chestnut.


Background & Summary
Trapa L., known as water chestnut or water caltrop, is the only genus of Trapaceae. Although the Angiosperm Phylogeny Group (APG) IV treated Trapaceae belonging to Lythraceae, the term "Trapaceae" is still used by some scholars today due to a handful of morphological differences between the two families 1 . Trapa plants are annual floating-leaved herbs naturally growing in temperate, subtropical and tropical regions of the Old World, and invasive in Australia and North America 2 . They reproduce sexually and/or asexually and have a high degree of autogamy 3,4 . The genus has two diversity centers, i.e. the Yangtze River Basin (central China) and the Amur River-Tumen River Basin (the border between China and Russia) 5 . Trapa plants have high edible value because of their large starchy seeds, which has a long history of consumption. In China, archaeological studies found that water chestnut was widely eaten during the Neolithic Age (7000-2000 BC) with 21 unearthed sites in the basins of the Yellow River and Yangtze River 6 . In ancient Europe, inhabitants also gathered water chestnut seeds as part of their diet between 4000 and 1000 BC 7 . The cultivation of water chestnut can be traced back to the Tang (618-907 AD) and Song (916-1279 AD) dynasties 8 in the middle and lower reaches of the Yangtze River. At present, it is an important aquatic crop widely grown in China and India 9 . Additionally, the tender Trapa seeds, stems and leaves are used as vegetables because of the fresh and sweet taste, whereas their seed pericarps are traditional Chinese medicine because of their bioactive components in the treatment of cancer, inflammation and atherosclerosis [10][11][12] . Furthermore, Trapa has significant ecological value in improving water quality due to its strong absorption capacity for heavy metals and pollutants 13 .
A better understanding of species identification, evolutionary relationships and genetic information will greatly facilitate the effective management and sustainable utilization of wild plant resources. However, the classification of Trapa species is still open to debate because of their similar morphology of vegetative organs and the highly variable seeds. Some scholars argued that the genus contained more than 20, 30 or 70 species, while others merged them into one or two polymorphic species 14 . The quantitative taxonomic studies based on morphological variations showed that Trapa species with similar seed sizes were closely related, and all species were divided into two branches, the large-and small-seed clusters 15 . This was well supported by the molecular studies based on chloroplast (cp) sequences 14,16 . The cp genome analysis also showed that both the geographical origin and tubercle morphology of seeds were of great significance for deducing relationship within Trapa 14 . Cytological studies showed two different chromomeric numbers in Trapa (2n = 2x = 48 and 2n = 4x = 96) and suggested that the tetraploid might be a hybrid of diploids 17 , which was supported by molecular analyses based on allozymes as well as nuclear and chloroplast DNA sequeences 18,19 . The existence of the two distinct subgenomes was directly confirmed by the recently published chromosome-level assembly of a tetraploid Trapa natans (AABB) genome 8 . Furthermore, the resequencing data exhibited that large-seed species contained both diploids (2n = 2x = 48, AA) and tetraploids (2n = 4x = 96, AABB), and the small-seed ones only contained diploids (2n = 2x = 48, BB) 8 . It is a pity that the genome sequences of representatives of the ' AA' and 'BB' genomes are not available, though such species are very common in the Trapa genus.
Here, we sequenced the genomes of the typical cultivated species Trapa bicornis Osbeck (AA) and a small-seed species Trapa incisa Sieb. et Zucc. (BB), which would greatly deepen the understanding of Trapa diversity and the origin of tetraploid Trapa. De novo assembly using PacBio high-fidelity (HiFi) long reads generated 479.90 and 463.97 Mb contigs for T. bicornis and T. incisa with N50 values of 13.51 and 13.77 Mb, respectively. After scaffolding by Hi-C reads, 98.0% and 98.1% of the contigs could be successfully anchored into 24 pseudo-chromosomes for each genome, respectively. We predicted 33,306 and 33,315 protein-coding genes in T. bicornis and T. incisa genomes, respectively. Despite good collinearity, there were 159,232 structural variations (SVs) identified between the genomes of T. bicornis and T. incisa, overlapping with more than 11 thousand genes. Divergence time estimation indicated that T. bicornis and T. incisa diverged around 1.51 million years ago. The generation of the two genomes provides baseline information of the diversity of Trapa species, which will eventually facilitate functional genomic analysis and molecular breeding of water chestnut.

Methods
Sample collection and sequencing. Seeds of T. bicornis and T. incisa were collected from Honghu (29.39°N/113.07°E), Hubei province, China (Fig. 1). Plants were cultured outdoors from March to July in water tanks in Wuhan Botanical Garden, Chinese Academy of Science, Hubei province, China. The 90-day-old individuals for each species were used for the DNA/RNA extractions.
Genomic DNA was isolated from fresh young leaves using Cetyltrimethylammonium bromide (CTAB) method 20 . A total amount of 1.5 µg DNA per sample was used as input material for the Illumina paired-end library construction. Each library with an average insert size of 350 bp was generated using Truseq Nano DNA HT Sample preparation Kit (Illumina USA) following manufacturer's instructions. These libraries were sequenced by Illumina HiSeq X Ten system. A total of 125.97 Gb and 53.14 Gb paired-end reads (PE150) covering roughly 183.38 × and 112.42 × of genomes were generated for T. bicornis and T. incisa, respectively (Table 1).
For PacBio long-read sequencing, about 10 µg genomic DNA were sheared into fragments of 10-20 kb in length by g-TUBE (Covaris USA). The fragmented DNA was purified by AMPure PB magnetic beads. The High-fidelity (HiFi) libraries were generated using SMRTbell Express Template Prep Kit 2.0 and sequenced on PacBio Sequel IIe platform (Pacific Biosciences, Menlo Park, USA). A total of 24.11 Gb and 20.42 Gb HiFi reads with N50 sizes of 17,588 bp and 13,963 bp were obtained using the CCS (Circular Consensus Sequencing) software with default parameters (https://ccs.how/), which covered 49.23 × and 43.20 × of T. bicornis and T. incisa genomes, respectively ( Table 1).
The high-throughput chromosome conformation capture (Hi-C) libraries were constructed using 5 µg DNA. The DNA crosslinking was performed by 4% formaldehyde. The linked DNA was digested with DpnII restriction endonuclease, labelled with biotin-14-DCTP and then ligated by T4 DNA Ligase. The ligated DNA was  Table 2). The cleaned Hi-C reads were mapped to the corresponding contigs using Juicer v1.9.9 22 . The unique mapped reads were taken as input for 3D-DNA pipeline v180114 23 with parameters "-r 0" and then sorted and corrected manually by using JuicerBox v1.11.08 24 . Finally, a total of 24 pseudo-chromosomes was obtained, which contained 98.01% and 98.14% of the assembled contigs for T. bicornis and T. incisa, respectively (Fig. 2).
We assessed the integrity of the genomes using the BUSCO v5.0 (Benchmarking Universal Single-Copy Orthologs) 25 26 . Based on the Illumina PE150 reads, we assessed the consensus quality values (QV) of the two assemblies using Merqury v2020-01-29 27 with "k-mer = 20". For T. bicornis and T. incisa assemblies, the mapping rate of the reads were 99.88% and 99.61%, respectively, and the QV values were 49.70 and 43.91, respectively (Table 2). These evaluations indicated that the two genome assemblies were of considerable completeness, contiguity and accuracy.
Genome annotation. Custom repeat libraries for each genome were constructed by screening the genome using LTR_finder 28 , ltrharvest 29 and RepeatModeler-2.0.2a 30   www.nature.com/scientificdata www.nature.com/scientificdata/ repeatmasker.org) was used to identify repeat sequences based on the custom libraries. A total of 307.95 Mb (62.88%) and 295.42 Mb (62.49%) repetitive sequences were annotated in the T. bicornis and T. incisa genomes, respectively (Table 3).
For protein-coding gene annotation, we employed RNA-seq-based, ab initio and homologue-based predictions to identify gene models. The clean RNA-seq reads were aligned to the assemblies using HISAT2 v2.2.1 33 , and then the alignment was converted to gtf format by StringTie2 v2.1.6 34 . Furthermore, TransDecoder v5.5.0 35 was used to identify the open reading frame (ORF) and modify the boundaries of exons. The ab initio gene predictions were generated by three de novo predicting programs, including Augustus-3.3.3 36 , SNAP v2006-07-28 37 and GlimmerHMM 3.0.4 38,39 . Proteins from Punica granatum 40 , Arabidopsis thaliana TAIR10 41 , Eucalyptus grandis 42 , Melaleuca alternifolia 43 and tetraploid Trapa natans 8 were aligned to the genomes using TBLASTN 44 .   Table 3. Genome annotation of repetitive sequences and protein-coding genes.

Repetitive sequence
www.nature.com/scientificdata www.nature.com/scientificdata/ The homologous genes were identified using Exonerate v2.2.0 45 . The RNA-seq evidences, ab initio predictions and homolog evidences were fed to MAKER v3.01 46 to generate the final gene set. A total of 33,306 and 33,315 protein-coding genes were predicted in the T. bicornis and T. incisa genomes, respectively.

Variations between the T. bicornis and T. incisa genomes. Single nucleotide polymorphisms (SNPs)
between the genomes of T. bicornis and T. incisa were detected by alignment of the two assemblies using NUCmer from MUMMER4 48 . We set the minimum alignment length to 100 bp and retained the uniquely matching fragments. A total of 9,449,234 SNPs were identified by show-snps tool from MUMMER4 48 (Fig. 3).
To identify SVs, T. incisa genome was mapped to T. bicornis genome by using Minimap2 49 with the parameter "-ax asm5". Assemblytics was adopted to extract unique alignments and identify SVs based on them 50 . Protein-coding genes overlapping with SV regions were retrieved by BEDTools v2.29.1 51 . The final SVs were classified into seven categories: deletion, insertion, repeat contraction, repeat expansion, tandem contraction, tandem expansion and substitution. A total of 159,232 SVs were identified between T. bicornis and T. incisa genomes, which accounted for 110.49 Mb and 140.13 Mb sequences of the two genomes, respectively (Table 4). These SVs overlapped with 11,265 and 11,621 genes of the two Trapa genomes, respectively.
The synteny between the published tetraploid T. natans genome and the present two diploid Trapa genomes. Our new assemblies provided great resource for investigating the origin of the Trapa www.nature.com/scientificdata www.nature.com/scientificdata/ tetraploid and the genomic changes post-polyploidization. The genomes of T. bicornis and T. incisa and the two subgenomes of the published tetraploid genome were pairwise aligned with each other by using MUMMER4 48 (Fig. 4). The syntenic regions were extracted from the alignments with the software syri-1.4 52 . Clearly, the T. bicornis and T. incisa genomes possessed the highest percentage of syntenic regions with the A and B subgenomes of T. natans, respectively, suggesting that the formers represented the ancestry genomes of the latter two, separately.    Ch15 Ch20 Ch24

Data records
The raw data of Illumina PE150 reads, PacBio HiFi long reads and Hi-C reads from T. bicornis were submitted to the National Center for Biotechnology

technical Validation
The quality scores across all bases and GC content of the Illumina raw sequencing data were inspected by FastQC v0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Contig level and chromosome level of the assemblies were assessed in four ways: N50 for continuity, QV for accuracy, BUSCO for completeness and paired-end reads mapping rate for consistency with raw data. The protein-coding genes were verified by values of BUSCO and functional databases annotation. For construction of phylogenetic tree, each branch received 100% bootstrap values.  Phylogenetic tree with estimated divergence times. The maximum likelihood tree was constructed based on 1,106 single-copy orthologous genes. The red dots at the nodes indicated that the values were supported by fossil evidence.